The Titanic prediction model aims to predict whether a passenger survived or did not survive the sinking of the Titanic, based on various features provided in the dataset. This dataset is commonly used for binary classification tasks in machine learning.
The dataset used for this prediction model is available on Kaggle and can be found at the following link: Titanic - Machine Learning from Disaster.
Here is a brief description of the columns in the dataset:

get_user_input function prompts the user to input values for each feature.predict_survival function takes the input data and uses the trained model to predict the survival status of a passenger.# Loading, Preprocessing, Analysis Libraries
import pandas as pd
import numpy as np
# Visulaiztion Libraries
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
%matplotlib inline
# Model Training And Testing libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler,StandardScaler, LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score, precision_recall_curve, auc, confusion_matrix, recall_score, accuracy_score, precision_score, f1_score, classification_report, f1_score, roc_curve
# Models Algorithms
from sklearn.ensemble import AdaBoostClassifier, HistGradientBoostingClassifier, RandomForestClassifier, GradientBoostingClassifier
# Profiling Libraries
from ydata_profiling import ProfileReport
titanic = pd.read_csv(r"D:\Projects\Python\CodeSoft Internship\Titanic_Survival Project\Titanic-Dataset.csv")
ProfileReport(titanic, title='Titanic Dataset', explorative=True).to_file(r"D:\Projects\Python\CodeSoft Internship\Titanic_Survival Project\Titanic-Dataset_Profile.html")
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]
titanic.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
titanic.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 891 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB
titanic.shape
(891, 12)
titanic.columns
Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',
'Embarked'],
dtype='object')
titanic.describe()
| PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
|---|---|---|---|---|---|---|---|
| count | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 |
| mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
| std | 257.353842 | 0.486592 | 0.836071 | 13.002015 | 1.102743 | 0.806057 | 49.693429 |
| min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 223.500000 | 0.000000 | 2.000000 | 22.000000 | 0.000000 | 0.000000 | 7.910400 |
| 50% | 446.000000 | 0.000000 | 3.000000 | 29.699118 | 0.000000 | 0.000000 | 14.454200 |
| 75% | 668.500000 | 1.000000 | 3.000000 | 35.000000 | 1.000000 | 0.000000 | 31.000000 |
| max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
titanic.select_dtypes(include=['object']).describe()
| Name | Sex | Ticket | Cabin | Embarked | |
|---|---|---|---|---|---|
| count | 891 | 891 | 891 | 204 | 891 |
| unique | 891 | 2 | 681 | 147 | 3 |
| top | Braund, Mr. Owen Harris | male | 347082 | B96 B98 | S |
| freq | 1 | 577 | 7 | 4 | 645 |
def percent_counts(df, feature):
total = df[feature].value_counts(dropna=False)
percent = round(df[feature].value_counts(dropna=False, normalize=True) * 100, 2)
percent_count = pd.concat([total, percent], keys=['Total', 'Percentage'], axis=1)
return percent_count
percent_counts(titanic, 'Embarked')
| Total | Percentage | |
|---|---|---|
| S | 645 | 72.39 |
| C | 168 | 18.86 |
| Q | 78 | 8.75 |
percent_counts(titanic, 'Embarked')
percent_counts(titanic, 'Sex')
| Total | Percentage | |
|---|---|---|
| male | 577 | 64.76 |
| female | 314 | 35.24 |
percent_counts(titanic, 'Cabin')
| Total | Percentage | |
|---|---|---|
| NaN | 687 | 77.10 |
| C23 C25 C27 | 4 | 0.45 |
| G6 | 4 | 0.45 |
| B96 B98 | 4 | 0.45 |
| C22 C26 | 3 | 0.34 |
| ... | ... | ... |
| E34 | 1 | 0.11 |
| C7 | 1 | 0.11 |
| C54 | 1 | 0.11 |
| E36 | 1 | 0.11 |
| C148 | 1 | 0.11 |
148 rows × 2 columns
percent_counts(titanic, 'Survived')
| Total | Percentage | |
|---|---|---|
| 0 | 549 | 61.62 |
| 1 | 342 | 38.38 |
percent_counts(titanic, 'Pclass')
| Total | Percentage | |
|---|---|---|
| 3 | 491 | 55.11 |
| 1 | 216 | 24.24 |
| 2 | 184 | 20.65 |
percent_counts(titanic, 'SibSp')
| Total | Percentage | |
|---|---|---|
| 0 | 608 | 68.24 |
| 1 | 209 | 23.46 |
| 2 | 28 | 3.14 |
| 4 | 18 | 2.02 |
| 3 | 16 | 1.80 |
| 8 | 7 | 0.79 |
| 5 | 5 | 0.56 |
continuous_values = []
categorical_values = []
for column in titanic.columns:
if titanic[column].dtype == 'int64' or titanic[column].dtype == 'float64':
continuous_values.append(column)
else:
categorical_values.append(column)
print("Continuous values: ", continuous_values)
print("Categorical values: ", categorical_values)
Continuous values: ['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare'] Categorical values: ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']
sns.countplot(x='Pclass', data=titanic)
plt.xlabel('Pclass')
plt.ylabel('Count')
plt.xticks([0, 1, 2], ['First', 'Second', 'Third'])
plt.title('Pclass Count')
plt.show()
sns.countplot(x='Sex', data=titanic, palette=colors)
plt.xlabel('Sex')
plt.ylabel('Count')
plt.xticks([0, 1], ['Female', 'Male'])
plt.title('Sex Count')
plt.show()
sns.countplot(x='SibSp', data=titanic, palette=colors)
plt.xlabel('SibSp')
plt.ylabel('Count')
plt.xticks([0, 1, 2, 3], ['0', '1', '2', '3'])
plt.title('SibSp Count')
plt.show()
titanic['Embarked'].value_counts()
S 646 C 168 Q 77 Name: Embarked, dtype: int64
sns.countplot(x='Embarked', data=titanic, palette=colors)
plt.xlabel('Embarked')
plt.ylabel('Count')
plt.xticks([0, 1, 2], ['Southampton', 'Cherbourg', 'Queenstown'])
plt.title('Embarked Count')
plt.show()
titanic['Fare'].value_counts()
65.6344 116
8.0500 43
13.0000 42
7.8958 38
7.7500 34
...
6.8583 1
34.6542 1
12.6500 1
12.0000 1
10.5167 1
Name: Fare, Length: 204, dtype: int64
plt.figure(figsize=(10, 6))
sns.histplot(titanic['Fare'], bins=30, kde=False, color= colors[0])
plt.title('Distribution of Fares')
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.show()
sns.histplot(data= titanic, x='Age', bins=30, kde=True)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
df_corr= titanic[continuous_values].corr()
df_corr
| PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
|---|---|---|---|---|---|---|---|
| PassengerId | 1.000000 | -0.005007 | -0.035144 | 0.035533 | -0.072778 | NaN | 0.003243 |
| Survived | -0.005007 | 1.000000 | -0.338481 | -0.065857 | 0.031434 | NaN | 0.317430 |
| Pclass | -0.035144 | -0.338481 | 1.000000 | -0.330962 | 0.023180 | NaN | -0.715300 |
| Age | 0.035533 | -0.065857 | -0.330962 | 1.000000 | -0.251585 | NaN | 0.137498 |
| SibSp | -0.072778 | 0.031434 | 0.023180 | -0.251585 | 1.000000 | NaN | 0.349615 |
| Parch | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Fare | 0.003243 | 0.317430 | -0.715300 | 0.137498 | 0.349615 | NaN | 1.000000 |
plt.figure(figsize=(19,7))
sns.heatmap(df_corr, annot = True)
plt.title('Correlation Matrix of Continuous Variables')
plt.show()
# Assuming titanic DataFrame is already defined
df1 = titanic.copy(deep=True)
# Select only numeric columns
numeric_cols = df1.select_dtypes(include=['number']).columns
# Calculate correlations with the 'Survived' column
corr = df1[numeric_cols].drop('Survived', axis=1).corrwith(df1['Survived'], numeric_only=True).sort_values(ascending=False).to_frame()
corr.columns = ['Correlations']
# Plot the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, linewidths=0.4, linecolor='black', fmt='.2f', cmap='coolwarm')
plt.title('Correlation with Survived')
plt.show()
titanic['Survived'].value_counts()
0 549 1 342 Name: Survived, dtype: int64
colors = ["#8B0000", "#FFDAB9", "#8B008B"]
# Calculate the percentage of survival and not survival
l = list(titanic['Survived'].value_counts())
circle = [l[1] / sum(l) * 100, l[0] / sum(l) * 100]
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12, 5)) # Adjust figsize here
# Pie chart
plt.subplot(1, 2, 1)
plt.pie(circle, labels=['Survived', 'Not Survived'], autopct='%1.1f%%', startangle=90,
explode=(0.1, 0), colors=colors[:2], wedgeprops={'edgecolor': 'black', 'linewidth': 1, 'antialiased': True})
plt.title('Survival %')
# Count plot
plt.subplot(1, 2, 2)
sns.countplot(x='Survived', data=titanic, palette=colors[:2])
plt.xlabel('Survival')
plt.ylabel('Count')
plt.xticks([0, 1], ['Not Survived', 'Survived'])
plt.title('Cases of Survival')
plt.tight_layout() # Adjust subplot parameters to give specified padding
plt.show()
# Prepare data
p2 = titanic.groupby(['Survived', 'Sex']).size().unstack(fill_value=0)
p3 = titanic.groupby(['Survived', 'Pclass']).size().unstack(fill_value=0)
p4 = titanic.groupby(['Survived', 'Embarked']).size().unstack(fill_value=0)
# Create figures
fig = make_subplots(rows=1, cols=1, subplot_titles=("Gender Distribution"), horizontal_spacing=0.05, vertical_spacing=0.05)
fig3 = make_subplots(rows=1, cols=1, subplot_titles=("Pclass Distribution"), horizontal_spacing=0.05, vertical_spacing=0.05)
fig4 = make_subplots(rows=1, cols=1, subplot_titles=("Embarked Distribution"), horizontal_spacing=0.05, vertical_spacing=0.05)
# Plot 1 - Gender Distribution
colors2 = ['#646782', '#CDD5DE']
for i, gender in enumerate(p2.columns):
fig.add_trace(go.Bar(x=p2.index, y=p2[gender], name=gender, marker_color=colors2[i]), row=1, col=1)
# Plot for Pclass Distribution
colors3 = ['#FF9999', '#66CCCC', '#339966']
for i, pclass_type in enumerate(p3.columns):
fig3.add_trace(go.Bar(x=p3.index, y=p3[pclass_type], name=pclass_type, marker_color=colors3[i]), row=1, col=1)
# Plot for Embarked Distribution
colors4 = ['#FFA07A', '#20B2AA', '#778899', '#8A2BE2'] # Define colors4
for i, embarked_type in enumerate(p4.columns):
fig4.add_trace(go.Bar(x=p4.index, y=p4[embarked_type], name=embarked_type, marker_color=colors4[i]), row=1, col=1)
# Update layout for Gender Distribution
fig.update_layout(showlegend=True, barmode='group', bargap=0.15, legend_title_text="Gender", height=400, width=800, plot_bgcolor='rgba(255, 255, 255, 0.7)')
fig.update_xaxes(title_text="Survival", row=1, col=1)
fig.update_yaxes(title_text="Frequency", row=1, col=1)
# Update layout for Pclass Distribution
fig3.update_layout(showlegend=True, barmode='group', bargap=0.15, legend_title_text="Pclass Type", height=400, width=800, plot_bgcolor='rgba(255, 255, 255, 0.7)')
fig3.update_xaxes(title_text="Survival", row=1, col=1)
fig3.update_yaxes(title_text="Frequency", row=1, col=1)
# Update layout for Embarked Distribution
fig4.update_layout(showlegend=True, barmode='group', bargap=0.15, legend_title_text="Embarked", height=400, width=800, plot_bgcolor='rgba(255, 255, 255, 0.7)')
fig4.update_xaxes(title_text="Survival", row=1, col=1)
fig4.update_yaxes(title_text="Frequency", row=1, col=1)
# Show plots
fig.show()
fig3.show()
fig4.show()
# Calculate the value counts for 'Embarked' and 'Sex' and reset the index
Em_sex = titanic[['Embarked', 'Sex']].value_counts().reset_index(name='count')
Em_sex
| Embarked | Sex | count | |
|---|---|---|---|
| 0 | S | male | 441 |
| 1 | S | female | 205 |
| 2 | C | male | 95 |
| 3 | C | female | 73 |
| 4 | Q | male | 41 |
| 5 | Q | female | 36 |
plt.figure(figsize=(7,6))
sns.barplot(data=Em_sex , x=Em_sex['Embarked'], y=Em_sex['count'], hue=Em_sex['Sex'])
plt.title('Embarked & Sex Frequency')
plt.xlabel('Embarked')
plt.ylabel('Frequency')
plt.show()
sv_sibling = titanic[['Survived', 'SibSp']].value_counts().reset_index(name='count')
sv_sibling
| Survived | SibSp | count | |
|---|---|---|---|
| 0 | 0 | 0.0 | 398 |
| 1 | 1 | 0.0 | 210 |
| 2 | 1 | 1.0 | 112 |
| 3 | 0 | 1.0 | 97 |
| 4 | 0 | 2.5 | 39 |
| 5 | 0 | 2.0 | 15 |
| 6 | 1 | 2.0 | 13 |
| 7 | 1 | 2.5 | 7 |
plt.figure(figsize=(8,6))
sns.barplot(data=sv_sibling , x=sv_sibling['Survived'], y=sv_sibling['count'], hue=sv_sibling['SibSp'])
plt.title('Survived & SibSp Frequency')
plt.legend(loc='upper right')
plt.xlabel('Survived')
plt.ylabel('Frequency')
plt.show()
def outlier_detect(df, col):
q1 = df[col].quantile(0.25)
q3 = df[col].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
return df[(df[col] < lower_bound) | (df[col] > upper_bound)]
def outlier_detect_normal(df, col):
mean = df[col].mean()
std_dev = df[col].std()
return df[((df[col] - mean) / std_dev).abs() > 3]
def lower_outlier(df, col):
q1 = df[col].quantile(0.25)
iqr = df[col].quantile(0.75) - q1
lower_bound = q1 - 1.5 * iqr
return df[df[col] < lower_bound]
def upper_outlier(df, col):
q3 = df[col].quantile(0.75)
iqr = q3 - df[col].quantile(0.25)
upper_bound = q3 + 1.5 * iqr
return df[df[col] > upper_bound]
def replace_upper(df, col):
q3 = df[col].quantile(0.75)
iqr = q3 - df[col].quantile(0.25)
upper_bound = q3 + 1.5 * iqr
df[col] = df[col].clip(upper=upper_bound)
print(f'Outliers in column {col} replaced with upper bound ({upper_bound})')
def replace_lower(df, col):
q1 = df[col].quantile(0.25)
iqr = df[col].quantile(0.75) - q1
lower_bound = q1 - 1.5 * iqr
df[col] = df[col].clip(lower=lower_bound)
print(f'Outliers in column {col} replaced with lower bound ({lower_bound})')
Q1 = titanic.quantile(0.25, numeric_only=True)
Q3 = titanic.quantile(0.75, numeric_only=True)
IQR = Q3 - Q1
for i in range(len(continuous_values)):
print("IQR => {}: {}".format(continuous_values[i], outlier_detect(titanic, continuous_values[i]).shape[0]))
print("Z_Score => {}: {}".format(continuous_values[i], outlier_detect_normal(titanic, continuous_values[i]).shape[0]))
print("********************************")
outlier = []
for i in range(len(continuous_values)):
if outlier_detect(titanic[continuous_values],continuous_values[i]).shape[0] !=0:
outlier.append(continuous_values[i])
outlier
for i in range(len(outlier)):
replace_upper(titanic, outlier[i])
print("\n********************************\n")
for i in range(len(outlier)):
replace_lower(titanic, outlier[i])
titanic.isnull().sum()
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
titanic['Age'].fillna(titanic['Age'].mean(), inplace=True)
titanic['Embarked'].fillna("Q", inplace=True)
titanic.isnull().sum()
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 0 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 0 dtype: int64
titanic.drop(labels=['Cabin', 'Name', 'Ticket', 'PassengerId'], axis=1, inplace=True)
titanic.head()
| Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | 1 | 22.0 | 1 | 0 | 7.2500 | 2 |
| 1 | 1 | 1 | 0 | 38.0 | 1 | 0 | 71.2833 | 0 |
| 2 | 1 | 3 | 0 | 26.0 | 0 | 0 | 7.9250 | 2 |
| 3 | 1 | 1 | 0 | 35.0 | 1 | 0 | 53.1000 | 2 |
| 4 | 0 | 3 | 1 | 35.0 | 0 | 0 | 8.0500 | 2 |
titanic.duplicated().sum()
0
selected_columns = [ 'Sex', 'Embarked']
le = LabelEncoder()
for col in selected_columns:
titanic[col] = le.fit_transform(titanic[col])
titanic.head()
| Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | 1 | 22.0 | 1 | 0 | 7.2500 | 2 |
| 1 | 1 | 1 | 0 | 38.0 | 1 | 0 | 71.2833 | 0 |
| 2 | 1 | 3 | 0 | 26.0 | 0 | 0 | 7.9250 | 2 |
| 3 | 1 | 1 | 0 | 35.0 | 1 | 0 | 53.1000 | 2 |
| 4 | 0 | 3 | 1 | 35.0 | 0 | 0 | 8.0500 | 2 |
mms = MinMaxScaler()
df1 = titanic.copy(deep=True)
df1[['Age', 'Fare']] = mms.fit_transform(df1[['Age', 'Fare']])
df1.head()
| Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | 1 | 0.271174 | 1 | 0 | 0.014151 | 2 |
| 1 | 1 | 1 | 0 | 0.472229 | 1 | 0 | 0.139136 | 0 |
| 2 | 1 | 3 | 0 | 0.321438 | 0 | 0 | 0.015469 | 2 |
| 3 | 1 | 1 | 0 | 0.434531 | 1 | 0 | 0.103644 | 2 |
| 4 | 0 | 3 | 1 | 0.434531 | 0 | 0 | 0.015713 | 2 |
titanic.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Survived 891 non-null int64 1 Pclass 891 non-null int64 2 Sex 891 non-null int64 3 Age 714 non-null float64 4 SibSp 891 non-null int64 5 Parch 891 non-null int64 6 Fare 891 non-null float64 7 Embarked 891 non-null int64 dtypes: float64(2), int64(6) memory usage: 55.8 KB
features = df1.drop('Survived', axis=1).values
target = df1['Survived'].values
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)
print("Training set features shape:", x_train.shape)
print("Testing set features shape:", x_test.shape)
print("Training set target shape:", y_train.shape)
print("Testing set target shape:", y_test.shape)
Training set features shape: (712, 7) Testing set features shape: (179, 7) Training set target shape: (712,) Testing set target shape: (179,)
# Creating DataFrames for saving to CSV
train_df = pd.DataFrame(x_train, columns=df1.drop('Survived', axis=1).columns)
train_df['Survived'] = y_train
test_df = pd.DataFrame(x_test, columns=df1.drop('Survived', axis=1).columns)
test_df['Survived'] = y_test
# File paths
train_csv_path = 'D:/Projects/Python/CodeSoft Internship/Titanic_Survival Project/train.csv'
test_csv_path = 'D:/Projects/Python/CodeSoft Internship/Titanic_Survival Project/test.csv'
# Saving to CSV
train_df.to_csv(train_csv_path, index=False)
test_df.to_csv(test_csv_path, index=False)
print(f"Training data saved to {train_csv_path}")
print(f"Testing data saved to {test_csv_path}")
Training data saved to D:/Projects/Python/CodeSoft Internship/Titanic_Survival Project/train.csv Testing data saved to D:/Projects/Python/CodeSoft Internship/Titanic_Survival Project/test.csv
def model_evaluation(classifier, x_test, y_test):
# Confusion Matrix
cm = confusion_matrix(y_test, classifier.predict(x_test))
names = ['True Neg', 'False Pos', 'False Neg', 'True Pos']
# Format confusion matrix values
labels = [['{}\n{}'.format(name, value) for name, value in zip(names, row)] for row in cm]
sns.heatmap(cm, annot=labels, fmt='', annot_kws={"size": 14})
plt.title('Confusion Matrix')
plt.show()
# Classification Report
print("\nClassification Report:\n", classification_report(y_test, classifier.predict(x_test)))
# ROC Curve
plot_roc_curve_custom(classifier, x_test, y_test)
def plot_roc_curve_custom(classifier, x_test, y_test):
# Calculate ROC curve
fpr, tpr, _ = roc_curve(y_test, classifier.predict_proba(x_test)[:, 1])
# Plot ROC curve
plt.figure(figsize=(8, 8))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()
def model(classifier, x_train, y_train, x_test, y_test):
classifier.fit(x_train, y_train)
prediction = classifier.predict(x_test)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# Calculate metrics
accuracy = accuracy_score(y_test, prediction)
precision = precision_score(y_test, prediction)
recall = recall_score(y_test, prediction)
f1 = f1_score(y_test, prediction)
cross_val_score_mean = cross_val_score(classifier, x_train, y_train, cv=cv, scoring='roc_auc').mean()
roc_auc = roc_auc_score(y_test, prediction)
print("Accuracy: {:.2%}".format(accuracy))
print("Precision: {:.2%}".format(precision))
print("Recall: {:.2%}".format(recall))
print("F1 Score: {:.2%}".format(f1))
print("Cross Validation Score: {:.2%}".format(cross_val_score_mean))
print("ROC_AUC Score: {:.2%}".format(roc_auc))
# Evaluation
model_evaluation(classifier, x_test, y_test)
rf = RandomForestClassifier(random_state = 42, n_estimators = 200, max_depth = 4, min_samples_leaf = 2)
model(rf, x_train, y_train, x_test, y_test)
Accuracy: 82.12% Precision: 85.00% Recall: 68.92% F1 Score: 76.12% Cross Validation Score: 86.28% ROC_AUC Score: 80.17%
Classification Report:
precision recall f1-score support
0 0.81 0.91 0.86 105
1 0.85 0.69 0.76 74
accuracy 0.82 179
macro avg 0.83 0.80 0.81 179
weighted avg 0.82 0.82 0.82 179
# Use the best model for predictions on the test set with selected features
y_pred_rf = rf.predict(x_test)
# Create a new DataFrame with selected features and predictions for RandomForestClassifier
result_rf = pd.DataFrame({ 'Actual': y_test, 'Predicted': y_pred_rf })
# Display the result dataframe
result_rf.head()
| Actual | Predicted | |
|---|---|---|
| 709 | 1 | 0 |
| 439 | 0 | 0 |
| 840 | 0 | 0 |
| 720 | 1 | 1 |
| 39 | 1 | 1 |
hist = HistGradientBoostingClassifier(random_state = 0, max_depth = 4, learning_rate = 0.1)
model(hist, x_train, y_train, x_test, y_test)
Accuracy: 82.68% Precision: 83.08% Recall: 72.97% F1 Score: 77.70% Cross Validation Score: 85.93% ROC_AUC Score: 81.25%
Classification Report:
precision recall f1-score support
0 0.82 0.90 0.86 105
1 0.83 0.73 0.78 74
accuracy 0.83 179
macro avg 0.83 0.81 0.82 179
weighted avg 0.83 0.83 0.82 179
# Use the best model for predictions on the test set with selected features
y_pred_h = hist.predict(x_test)
# Create a new DataFrame with selected features and predictions for RandomForestClassifier
result_h = pd.DataFrame({ 'Actual': y_test, 'Predicted': y_pred_h })
# Display the result dataframe
result_h.head()
| Actual | Predicted | |
|---|---|---|
| 709 | 1 | 0 |
| 439 | 0 | 0 |
| 840 | 0 | 0 |
| 720 | 1 | 1 |
| 39 | 1 | 1 |
gb = GradientBoostingClassifier(random_state = 0, max_depth = 4)
model(gb, x_train, y_train, x_test, y_test)
Accuracy: 81.01% Precision: 81.25% Recall: 70.27% F1 Score: 75.36% Cross Validation Score: 85.66% ROC_AUC Score: 79.42%
Classification Report:
precision recall f1-score support
0 0.81 0.89 0.85 105
1 0.81 0.70 0.75 74
accuracy 0.81 179
macro avg 0.81 0.79 0.80 179
weighted avg 0.81 0.81 0.81 179
Best Model by Accuracy:
Best Model by Cross Validation Score:
def predict_survival(model, input_data):
# Convert input data to a list
input_data_as_list = list(input_data)
# Reshape the list as we are predicting for only one instance
input_data_reshaped = [input_data_as_list]
# Make prediction using the model
prediction = model.predict(input_data_reshaped)
# Return the prediction
return prediction[0]
# Function to take input from the user
def get_user_input():
pclass = int(input("Enter Passenger Class (1, 2, or 3): "))
sex = int(input("Enter Sex (0 for female, 1 for male): "))
age = float(input("Enter Age: "))
sibsp = int(input("Enter Number of Siblings/Spouses Aboard: "))
parch = int(input("Enter Number of Parents/Children Aboard: "))
fare = float(input("Enter Fare: "))
embarked = int(input("Enter Port of Embarkation (0 for Cherbourg, 1 for Queenstown, 2 for Southampton): "))
# Convert all input data to a list
input_data_as_list = [pclass, sex, age, sibsp, parch, fare, embarked]
return input_data_as_list
input_data = get_user_input()
result = predict_survival(model, input_data)
# Print results
print("\nIndividual 1:", "Survived" if result == 1 else "Not Survived")
Enter Passenger Class (1, 2, or 3): 1 Enter Sex (0 for female, 1 for male): 1 Enter Age: 0.66 Enter Number of Siblings/Spouses Aboard: 1 Enter Number of Parents/Children Aboard: 0 Enter Fare: 0.81 Enter Port of Embarkation (0 for Cherbourg, 1 for Queenstown, 2 for Southampton): 2 Individual 1: Not Survived